{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "acf96c82",
   "metadata": {},
   "source": [
    "# 随机森林\n",
    "\n",
    "1. **随机森林基础概念**\n",
    "\n",
    "        随机森林就是集成学习的思想，将多棵树集成的一种算法，它的基本单元是决策树，而它的本质属于机器学习的一大分支——集成学习（Ensemble Learning）。\n",
    "\n",
    "1. 1 **集成学习**\n",
    "\n",
    "        集成学习通过建立几个模型的组合来解决单一预测问题。\n",
    "        \n",
    "        它的工作原理是生成多个分类器/模型，各自独立地学习和做出预测结果。这些预测最后结合成单分类预测。因此优于任何一个单分类做出的预测。\n",
    "        \n",
    "2. **解释**\n",
    "\n",
    "        从直观的角度来解释，每棵决策树都是一个分类器。\n",
    "        假设现在针对的问题是分类问题，对于一个输入样本，N 棵树就会有N 个分类结果。而随机森林集成了所有的分类投票结果，将投票次数最多的类别指定为最终输出。\n",
    "        \n",
    "        所以说随机森林是以决策树为基础的一种更高级的算法。\n",
    "        \n",
    "        像决策树一样，随机森林既可以用于回归，也可以用于分类。"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e812e264",
   "metadata": {},
   "source": [
    "# 随机森林的优缺点"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "46fac122",
   "metadata": {},
   "source": [
    "# 随机森实战——红酒数据集案例\n",
    "\n",
    "数据里含有 178 个样本分别属于 3 个类别，这些类别已给出。\n",
    "\n",
    "每个样本中含有 13 个特征成分（化学成分），分析确定了 13 种成分的数量，然后对其余葡萄酒进行分析发现该葡萄酒的分类。\n",
    "\n",
    "\n",
    "### 决策树 vs 随机森林"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "7c00e9ee",
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.tree import DecisionTreeClassifier\n",
    "from sklearn.ensemble import RandomForestClassifier\n",
    "import matplotlib.pyplot as plt\n",
    "from sklearn.model_selection import train_test_split,cross_val_score\n",
    "from sklearn.datasets import load_wine\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "c9293050",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Single Tree:0.9305555555555556 Random Forest:0.9722222222222222\n"
     ]
    }
   ],
   "source": [
    "wine = load_wine()\n",
    "X = wine.data\n",
    "y = wine.target\n",
    "\n",
    "Xtrain, Xtest, ytrain, ytest = train_test_split(X,y,test_size=0.4)\n",
    "\n",
    "# 随机森林 vs 决策树\n",
    "\n",
    "clf = DecisionTreeClassifier(random_state=1,criterion='entropy')\n",
    "# n_estimators 森林中树木的数量\n",
    "rfc = RandomForestClassifier(random_state=1,criterion='entropy',n_estimators=25)\n",
    "\n",
    "# 模型拟合\n",
    "clf = clf.fit(Xtrain, ytrain)\n",
    "rfc = rfc.fit(Xtrain, ytrain)\n",
    "\n",
    "# 模型评分\n",
    "score_c = clf.score(Xtest, ytest)\n",
    "score_r = rfc.score(Xtest, ytest)\n",
    "\n",
    "print(\"Single Tree:{}\".format(score_c)\n",
    "      ,\"Random Forest:{}\".format(score_r)\n",
    "     )\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "05e83ba7",
   "metadata": {},
   "source": [
    "#### 为了观察更稳定的结果，下面进行十组[交叉验证](https://blog.csdn.net/weixin_42211626/article/details/100064842)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ac201c19",
   "metadata": {},
   "source": [
    "**自主法**\n",
    "\n",
    "        自助法是基于自助采样法的检验方法。对于总数为n的样本合集，进行n次有放回的随机抽样，得到大小为n的训练集。\n",
    "        n次采样过程中，有的样本会被重复采样，有的样本没有被抽出过，将这些没有被抽出的样本作为验证集，进行模型验证。\n",
    "        \n",
    "cross_val_score参数设置\n",
    "sklearn.model_selection.cross_val_score(estimator, X, y=None, groups=None, scoring=None, cv=’warn’, n_jobs=None, verbose=0, fit_params=None, pre_dispatch=‘2*n_jobs’, error_score=’raise-deprecating’)\n",
    "参数：\n",
    "\n",
    "1. **estimator**： 需要使用交叉验证的算法\n",
    "\n",
    "\n",
    "2. **X**： 输入样本数据\n",
    "\n",
    "\n",
    "3. **y**： 样本标签\n",
    "\n",
    "\n",
    "4. **groups**： 将数据集分割为训练/测试集时使用的样本的组标签（一般用不到）\n",
    "\n",
    "\n",
    "5. **cv**： 交叉验证折数或可迭代的次数\n",
    "\n",
    "\n",
    "6. **n_jobs**： 同时工作的cpu个数（-1代表全部）\n",
    "\n",
    "\n",
    "7. **verbose**： 详细程度\n",
    "\n",
    "\n",
    "8. **fit_params**： 传递给估计器（验证算法）的拟合方法的参数\n",
    "\n",
    "\n",
    "9. **pre_dispatch**： 控制并行执行期间调度的作业数量。减少这个数量对于避免在CPU发送更多作业时CPU内存消耗的扩大是有用的。该参数可以是：\n",
    "    1. 没有，在这种情况下，所有的工作立即创建并产生。将其用于轻量级和快速运行的作业，以避免由于按需产生作业而导致延迟\n",
    "    2. 一个int，给出所产生的总工作的确切数量\n",
    "    3. 一个字符串，给出一个表达式作为n_jobs的函数，如’2 * n_jobs\n",
    "\n",
    "\n",
    "10. **error_score**： 如果在估计器拟合中发生错误，要分配给该分数的值（一般不需要指定）\n",
    "\n",
    "\n",
    "11. **scoring**： 交叉验证最重要的就是他的验证方式，选择不同的评价方法，会产生不同的评价结果。具体可用哪些评价指标，官方已给出详细解释，链接：https://scikit-learn.org/stable/modules/model_evaluation.html#scoring-parameter   \n",
    "<img src='img/scoring.png' width=600px />\n",
    "（具体的可以看这篇文章：https://blog.csdn.net/marsjhao/article/details/78678276）"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "7578459e",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "image/png": "iVBORw0KGgoAAAANSUhEUgAAAXoAAAD4CAYAAADiry33AAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjQuMywgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy/MnkTPAAAACXBIWXMAAAsTAAALEwEAmpwYAAAnqElEQVR4nO3deXxU5dn/8c+VhF1BNqkSkWABCRC2gGwKigLulta6simlaBGXtor2UfHXjbY8FhQqpQgupaJSoNTHugB1RyBABFlURIS4IsgiiCQz1++PM4QQAhlgwoST7/v1mldy5r7POdcM5Dtn7jnnHnN3REQkvFKSXYCIiJQtBb2ISMgp6EVEQk5BLyIScgp6EZGQS0t2ASWpV6+eN27cONlliIgcN5YsWfKVu9cvqa1cBn3jxo3JyclJdhkiIscNM/v4YG0auhERCTkFvYhIyCnoRURCTkEvIhJyCnoRkZBT0IuIhJyCXkQk5MrlefTHq935EZZu+JplG7byXX4k2eVQs1olru98OlUrpSa7FBFJIgX9Udgb7G+v28Lb6zaTu2EreyJRAMySXBzgDm+u/YqJ/TtQJU1hL1JRKegPw+78CMs2bOXtdZt5e91mlm3cyp6CKCkGrRrWYlC3xnRuUofsxnWoWbVSssvlqUUbuHvmCn42bSl/ua4DldM0UidSESnoD6HUYO9avoK9uGs6NSISdf5n9rsM/8dSJlzXnkqpCnuRikZBX8TxHuwlub7z6USizv1zVjLiqWU8dE07hb1IBVOhg353foTcjfuCfemGfcHe8tRaDOxyOl3OqHtcBXtJBnZtTEHU+fVzq7j96VzGXtWWNIW9SIVRoYI+nmDv3CQI9lrVjt9gL8mN3TOIRp3fPr+a1BTjwR+3JTWlHHxiLCJlLtRBX5GDvSQ/OacJBVHnDy+sIdWMP13ZRmEvUgGEKui/K4iQu2Fr4emOSzd8zXcFUcygVQUM9pLc1PMMItEoY156n9QU4w8/zCJFYS+SdO7O5p17qHdClYRvOzRBvzs/Qodfv8zOPRHMoOWpNenfOQj2jhkVN9hLMvy8phREnbFzPyA1xfjdD1or7EWOMXdn3Vc7YyMOwcFp5dQU3hx5XsL3FZqgr1opldsvaEbjujUU7HG4tVdTIlHn4flrSU0xfnNFK6w8XOUlElIlBfumHd8B0KBmFbqeUZfOTeoSiXrCh1RDE/QAQ85ukuwSjhtmxh0XNKMg6jzyyoekpRijLmupsBdJkEMF+8kn7gv2zk3q0rhu9TL924sr6M2sLzAOSAUmu/voYu21gSnAGcBu4AZ3fzfWdjswBHBgBTDY3Xcn7BHIETMz7uzTnEjUmfTaOlJSjPsuyVTYixwBd+ejr3YWhvrb6zbzZZKCvbhSg97MUoEJwAVAHrDYzOa4+6oi3e4Bct39B2Z2Zqx/LzNrCIwAMt39WzN7BrgaeCzBj0OOkJlx94VnUhBxprz5EWkpxj0XtVDYi5SitGDvksRgLy6eI/pOwFp3XwdgZtOBy4GiQZ8J/B7A3deYWWMza1BkH9XMLB+oDnyaqOIlMcyMey9pQSQa5W+vf0RqSgp39W2usBcporRg3xvqnZvUIaNejXL19xNP0DcENhZZzgPOKtbnHaAf8IaZdQJOB9LdfYmZjQE2AN8CL7n7SyXtxMyGAkMBGjVqdFgPQo6eWTBGH3Fn4qvBmP3PezcrV/9ZRY4ld2f95l0s+HDzcRfsxcUT9CVV78WWRwPjzCyXYBx+GVAQG7u/HMgAtgLPmtn17v73AzboPgmYBJCdnV18+3IMmBn/77JWRKLO+P+uJS3VuO38ZskuS+SY2Bvse0P97XWb+WJ7EOz1T6xCl+Mo2IuLJ+jzgNOKLKdTbPjF3bcDgwEsePQfxW59gI/cfVOsbSbQFTgg6KV8SEkxfntFawoisfPszbilV9Nkl3XUIlHn6117iESdgqgTiTgRdyLRKAVRpyDiRKJ77yu+HC1cLog6Uff9liPRaJHfY32KLQd9IBKNHnCUJMn39a58Fn0UnmAvLp6gXww0NbMM4BOCD1OvLdrBzE4Cdrn7HoIzbF5z9+1mtgHobGbVCYZuegE5CaxfykBKijH6h1lE3Pnfl98nNdW4uef3k13WEcmPRJm5NI+H568l7+tvj/n+U1MsuJmRlmKkpBi6Nq38qVYplU4ZQah3blKXJsd5sBdXatC7e4GZDQdeJDi9coq7rzSzYbH2iUAL4AkzixB8SHtjrG2hmc0AlgIFBEM6k8rkkUhCpaYYf/pRGyJR548vvEdaijH0nDOSXVbcigd8VnotbuyeQZW0VNL2hm/stnc5LdVIMSMtJaXY8r72tJQiffZbtgOXUyxUYSHHL3Mvf28ks7OzPSdHB/7lQUEkym1P5/Lc8s+495JMbuyekeySDik/EmXW0k94+L8fsHFLEPC3nd+Uc5ufrNCVUDOzJe6eXVJbqK6MlcRLS01h7FVtiXown31aijGwa+Nkl3WAkgL+gctaKuBFUNBLHNJSUxh3dTsi0aXcP2clKSlG/86nJ7ssQAEvEg8FvcSlUmoKD1/TnpunLeHe2e+SlmJc0yl51zso4EXip6CXuFVOS2HCde0Z9uQS7p65glQzftzxtNJXTKD8SJRZyz5h/Py1bNiyi9YNazFqYEvOO1MBL3IwCno5LFXSUnnk+g4MfXIJd81cTmqK8cMO6WW+35IC/tGB2Qp4kTgo6OWwVa2UyqT+HRjyeA6/mPEOqSnGFe0alsm+FPAiR09BL0ekaqVU/jYgmxseW8wdz+SSmmJc2ubUhG2/eMC3aliTyQOy6dVCAS9yuBT0csSqVU7l0UHZDJq6mNueDsL+otanHNU2C/YG/H/X8vFmBbxIIijo5ahUr5zG1EEdGThlESOeWkaKGX1bfe+wt6OAFyk7ujJWEuKb7woY8OhCludt45HrO3BBZoPSV6LkgL+tVzMFvMhhOtSVsQp6SZjtu/Pp/+giVn26jb/278B5Zx487AsiUWbnfsrD8z/g4827aHlqTW47vxnnK+BFjoiCXo6Zbd/mc/3khbz3+Q7+NjCbHs3q79eugBcpGwp6Oaa27trDdZMX8sGX3zBlYEe6N62ngBcpYwp6Oea+3rmHa/72Nh99tZObe36fmcvyFPAiZUizV8oxV7tGZaYNOYtr/7aQP899n8xTajKpf/AhrQJe5NhS0EuZqXtCFZ75aRdWfbadzk3qKOBFkkRBL2WqVvVKdDmjbrLLEKnQUpJdgIiIlC0FvYhIyCnoRURCTkEvIhJycQW9mfU1s/fMbK2ZjSyhvbaZzTKz5Wa2yMxaFWk7ycxmmNkaM1ttZl0S+QBEROTQSg16M0sFJgAXApnANWaWWazbPUCuu2cBA4BxRdrGAS+4+5lAG2B1IgoXEZH4xHNE3wlY6+7r3H0PMB24vFifTGAegLuvARqbWQMzqwmcAzwaa9vj7lsTVbyIiJQunqBvCGwsspwXu6+od4B+AGbWCTgdSAeaAJuAqWa2zMwmm1mNknZiZkPNLMfMcjZt2nSYD0NERA4mnqAv6XLG4hPkjAZqm1kucAuwDCgguCCrPfCIu7cDdgIHjPEDuPskd8929+z69euX1EVERI5APFfG5gGnFVlOBz4t2sHdtwODASy4zv2j2K06kOfuC2NdZ3CQoBcRkbIRzxH9YqCpmWWYWWXgamBO0Q6xM2sqxxaHAK+5+3Z3/xzYaGbNY229gFUJql1EROJQ6hG9uxeY2XDgRSAVmOLuK81sWKx9ItACeMLMIgRBfmORTdwCTIu9EKwjduQvIiLHhuajFxEJgUPNR68rY0VEQk5BLyIScgp6EZGQU9CLiIScgl5EJOQU9CIiIaegFxEJOQW9iEjIKehFREJOQS8iEnIKehGRkFPQi4iEnIJeRCTkFPQiIiGnoBcRCTkFvYhIyCnoRURCTkEvIhJyCnoRkZBT0IuIhJyCXkQk5BT0IiIhF1fQm1lfM3vPzNaa2cgS2mub2SwzW25mi8ysVbH2VDNbZmbPJapwERGJT6lBb2apwATgQiATuMbMMot1uwfIdfcsYAAwrlj7rcDqoy9XREQOVzxH9J2Ate6+zt33ANOBy4v1yQTmAbj7GqCxmTUAMLN04GJgcsKqFhGRuMUT9A2BjUWW82L3FfUO0A/AzDoBpwPpsbaxwJ1A9FA7MbOhZpZjZjmbNm2KoywREYlHPEFvJdznxZZHA7XNLBe4BVgGFJjZJcCX7r6ktJ24+yR3z3b37Pr168dRloiIxCMtjj55wGlFltOBT4t2cPftwGAAMzPgo9jtauAyM7sIqArUNLO/u/v1CahdRETiEM8R/WKgqZllmFllgvCeU7SDmZ0UawMYArzm7tvd/W53T3f3xrH15ivkRUSOrVKP6N29wMyGAy8CqcAUd19pZsNi7ROBFsATZhYBVgE3lmHNIiJyGMy9+HB78mVnZ3tOTk6yyxAROW6Y2RJ3zy6pTVfGioiEnIJeRCTk4jnrRkRCKj8/n7y8PHbv3p3sUiROVatWJT09nUqVKsW9joJepALLy8vjxBNPpHHjxgRnRkt55u5s3ryZvLw8MjIy4l5PQzciFdju3bupW7euQv44YWbUrVv3sN+BKehFKjiF/PHlSP69FPQiklSpqam0bduWVq1acemll7J169aEbPexxx5j+PDhCdlWUT179qR58+a0bduWtm3bMmPGjITvA2D9+vX84x//SMi2FPQiklTVqlUjNzeXd999lzp16jBhwoRkl1SqadOmkZubS25uLj/60Y/iWqegoOCw9qGgF5FQ6tKlC5988gkAixYtomvXrrRr146uXbvy3nvvAcGRer9+/ejbty9NmzblzjvvLFx/6tSpNGvWjB49evDmm28W3v/xxx/Tq1cvsrKy6NWrFxs2bABg0KBB3HTTTZx77rk0adKEV199lRtuuIEWLVowaNCguOvesmULV1xxBVlZWXTu3Jnly5cDMGrUKIYOHUrv3r0ZMGAAmzZt4oc//CEdO3akY8eOhTW++uqrhe8Q2rVrx44dOxg5ciSvv/46bdu25c9//vNRPa8660ZEAHjg3ytZ9en2hG4z89Sa3H9py7j6RiIR5s2bx403BjOonHnmmbz22mukpaUxd+5c7rnnHv75z38CkJuby7Jly6hSpQrNmzfnlltuIS0tjfvvv58lS5ZQq1Ytzj33XNq1awfA8OHDGTBgAAMHDmTKlCmMGDGC2bNnA/D1118zf/585syZw6WXXsqbb77J5MmT6dixI7m5ubRt2/aAWq+77jqqVasGwLx58xg1ahTt2rVj9uzZzJ8/nwEDBpCbmwvAkiVLeOONN6hWrRrXXnstt99+O927d2fDhg306dOH1atXM2bMGCZMmEC3bt345ptvqFq1KqNHj2bMmDE899zRfzGfgl5Ekurbb7+lbdu2rF+/ng4dOnDBBRcAsG3bNgYOHMgHH3yAmZGfn1+4Tq9evahVqxYAmZmZfPzxx3z11Vf07NmTvdOcX3XVVbz//vsALFiwgJkzZwLQv3///d4FXHrppZgZrVu3pkGDBrRu3RqAli1bsn79+hKDftq0aWRn75tt4I033ih8ETrvvPPYvHkz27ZtA+Cyyy4rfFGYO3cuq1atKlxv+/bt7Nixg27dunHHHXdw3XXX0a9fP9LT00kkBb2IAMR95J1oe8fot23bxiWXXMKECRMYMWIE9957L+eeey6zZs1i/fr19OzZs3CdKlWqFP6emppaOP4d7xkpRfvt3VZKSsp+201JSYl7XL2kOcP27qNGjRqF90WjURYsWFAY/HuNHDmSiy++mOeff57OnTszd+7cuPYbL43Ri0i5UKtWLR566CHGjBlDfn4+27Zto2HD4MvsHnvssVLXP+uss3jllVfYvHkz+fn5PPvss4VtXbt2Zfr06UBwNN69e/eE1n7OOecwbdo0AF555RXq1atHzZo1D+jXu3dvxo8fX7i8d3jnww8/pHXr1tx1111kZ2ezZs0aTjzxRHbs2JGQ+hT0IlJutGvXjjZt2jB9+nTuvPNO7r77brp160YkEil13VNOOYVRo0bRpUsXzj//fNq3b1/Y9tBDDzF16lSysrJ48sknGTduXELrHjVqFDk5OWRlZTFy5Egef/zxEvs99NBDhf0yMzOZOHEiAGPHjqVVq1a0adOGatWqceGFF5KVlUVaWhpt2rQ56g9jNU2xSAW2evVqWrRokewy5DCV9O+maYpFRCowBb2ISMgp6EVEQk5BLyIScgp6EZGQU9CLiIRcXEFvZn3N7D0zW2tmI0tor21ms8xsuZktMrNWsftPM7P/mtlqM1tpZrcm+gGIyPFt7zTFLVu2pE2bNjz44INEo9Ej2tZ99913yKtKJ06cyBNPPHGkpQKwYsWKwgnI6tSpQ0ZGBm3btuX8888/qu2WKXc/5A1IBT4EmgCVgXeAzGJ9/gTcH/v9TGBe7PdTgPax308E3i++bkm3Dh06uIiUvVWrViW7BK9Ro0bh71988YX36tXL77vvviRWFL+BAwf6s88+e8D9+fn5Zbrfkv7dgBw/SKbGc0TfCVjr7uvcfQ8wHbi8WJ9MYF7shWMN0NjMGrj7Z+6+NHb/DmA10PBIXpBEJPxOPvlkJk2axPjx43F3IpEIv/zlL+nYsSNZWVn89a9/Lez7xz/+kdatW9OmTRtGjgwGGgYNGlT4RSAjR44kMzOTrKwsfvGLXwDBFaxjxowBgukHOnfuTFZWFj/4wQ/4+uuvgeCLRe666y46depEs2bNeP311+OqvWfPntxzzz306NGDcePGsWTJEnr06EGHDh3o06cPn332GRBMd9C3b186dOjA2WefzZo1axLz5B1CPJOaNQQ2FlnOA84q1ucdoB/whpl1Ak4H0oEv9nYws8ZAO2BhSTsxs6HAUIBGjRrFV72IJM5/RsLnKxK7ze+1hgtHH9YqTZo0IRqN8uWXX/Kvf/2LWrVqsXjxYr777ju6detG7969WbNmDbNnz2bhwoVUr16dLVu27LeNLVu2MGvWLNasWYOZlfitVQMGDODhhx+mR48e3HfffTzwwAOMHTsWCL4kZNGiRTz//PM88MADcU8ytnXrVl599VXy8/Pp0aMH//rXv6hfvz5PP/00v/rVr5gyZQpDhw5l4sSJNG3alIULF3LzzTczf/78w3qODlc8QV/SdHDF500YDYwzs1xgBbAMKJz2zcxOAP4J3ObuJU547e6TgEkQTIEQR10iElIem5rlpZdeYvny5YVH6du2beODDz5g7ty5DB48mOrVqwNQp06d/davWbMmVatWZciQIVx88cVccskl+7Vv27aNrVu30qNHDwAGDhzIlVdeWdjer18/ADp06MD69evjrvuqq64C4L333uPdd98tnHI5Eolwyimn8M033/DWW2/tt6/vvvsu7u0fqXiCPg84rchyOvBp0Q6x8B4MYMHcnB/FbphZJYKQn+buMxNQs4iUhcM88i4r69atIzU1lZNPPhl35+GHH6ZPnz779XnhhRcOOSVxWloaixYtYt68eUyfPp3x48cf1lHz3umKi06BHI+9UxK7Oy1btmTBggX7tW/fvp2TTjqpcNbKYyWeMfrFQFMzyzCzysDVwJyiHczspFgbwBDgNXffHgv9R4HV7v5gIgsXkfDZtGkTw4YNY/jw4ZgZffr04ZFHHin80pH333+fnTt30rt3b6ZMmcKuXbsADhi6+eabb9i2bRsXXXQRY8eOPSBYa9WqRe3atQvH35988snCo/tEaN68OZs2bSoM+vz8fFauXEnNmjXJyMgonELZ3XnnnXcStt+DKfWI3t0LzGw48CLBGThT3H2lmQ2LtU8EWgBPmFkEWAXcGFu9G9AfWBEb1gG4x92fT+zDEJHj1d5vmMrPzyctLY3+/ftzxx13ADBkyBDWr19P+/btcXfq16/P7Nmz6du3L7m5uWRnZ1O5cmUuuugifve73xVuc8eOHVx++eXs3r0bdy9xmt/HH3+cYcOGsWvXLpo0acLUqVMT9pgqV67MjBkzGDFiBNu2baOgoIDbbruNli1bMm3aNG666SZ+85vfkJ+fz9VXX02bNm0Stu+SaJpikQpM0xQfnzRNsYiI7EdBLyIScgp6EZGQU9CLVHDl8XM6Obgj+fdS0ItUYFWrVmXz5s0K++OEu7N582aqVq16WOvFc8GUiIRUeno6eXl5bNq0KdmlSJyqVq1Kenr6Ya2joBepwCpVqkRGRkayy5AypqEbEZGQU9CLiIScgl5EJOQU9CIiIaegFxEJOQW9iEjIKehFREJOQS8iEnIKehGRkFPQi4iEnIJeRCTkFPQiIiGnoBcRCTkFvYhIyCnoRURCLq6gN7O+Zvaema01s5EltNc2s1lmttzMFplZq3jXFRGRslVq0JtZKjABuBDIBK4xs8xi3e4Bct09CxgAjDuMdUVEpAzFc0TfCVjr7uvcfQ8wHbi8WJ9MYB6Au68BGptZgzjXFRGRMhRP0DcENhZZzovdV9Q7QD8AM+sEnA6kx7kusfWGmlmOmeXo+ytFRBInnqC3Eu4r/pXxo4HaZpYL3AIsAwriXDe4032Su2e7e3b9+vXjKEtEROIRz5eD5wGnFVlOBz4t2sHdtwODAczMgI9it+qlrSsiImUrniP6xUBTM8sws8rA1cCcoh3M7KRYG8AQ4LVY+Je6roiIlK1Sj+jdvcDMhgMvAqnAFHdfaWbDYu0TgRbAE2YWAVYBNx5q3bJ5KCIiUhJzL3HIPKmys7M9Jycn2WWIiBw3zGyJu2eX1KYrY0VEQk5BLyIScgp6EZGQU9CLiIScgl5EJOQU9CIiIaegFxEJOQW9iEjIKehFREJOQS8iEnIKehGRkFPQi4iEnIJeRCTkFPQiIiGnoBcRCTkFvYhIyCnoRURCTkEvIhJyCnoRkZBT0IuIhJyCXspWNALffJnsKkQqtLiC3sz6mtl7ZrbWzEaW0F7LzP5tZu+Y2UozG1yk7fbYfe+a2VNmVjWRD0DKsfzdMO1KeDATFvwF3JNdkUiFVGrQm1kqMAG4EMgErjGzzGLdfgascvc2QE/gf82sspk1BEYA2e7eCkgFrk5g/VJeFXwHz/SHD+fBqW3hxbth+nXw7dfJrkykwonniL4TsNbd17n7HmA6cHmxPg6caGYGnABsAQpibWlANTNLA6oDnyakcim/CvbAMwPhg5fg0nFw48vQ53fwwYsw8RzIW5LsCkUqlHiCviGwschyXuy+osYDLQhCfAVwq7tH3f0TYAywAfgM2ObuL5W0EzMbamY5ZpazadOmw3wYUm5E8mHGYHj/P3Dxg9BhEJhBl5/BDS8CDlP6aChH5BiKJ+ithPuK/4X2AXKBU4G2wHgzq2lmtQmO/jNibTXM7PqSduLuk9w9292z69evH2f5Uq5E8mHGDbDmObjwT9Dxxv3b07Php69B0wuCoZynr9dQjsgxEE/Q5wGnFVlO58Dhl8HATA+sBT4CzgTOBz5y903ung/MBLoefdlS7kQKYOZQWD0H+vwezhpacr/qdeDqf0Dv38L7L8Bfz4FPNJQjUpbiCfrFQFMzyzCzygQfps4p1mcD0AvAzBoAzYF1sfs7m1n12Ph9L2B1ooqXciIagdnDYOVM6P0b6HLzofubQdfhMPiFYPjm0T7w9kQN5YiUkVKD3t0LgOHAiwQh/Yy7rzSzYWY2LNbt10BXM1sBzAPucvev3H0hMANYSjB2nwJMKoPHIckSjcDsm2HFs3D+KOh6S/zrntYxGMr5/vnwwl2xoZytZVWpSIVlXg6PorKzsz0nJyfZZUhpolGYcwvk/h3O+x8455dHth13WDAB5t4PNRvClVOhYYfE1ir75OXAor9BjXrQuDs06gLVTkp2VXKUzGyJu2eX2KaglyMSjcJzt8LSJ6Dn3dDzgOvoDt/GRfDsYPjmC+jzW+g0NBjmkcTIy4FXRsPal6FKLSjYDZHvAINTsqDx2cGtUWcF/3FIQS+J5Q7/dwfkTAmO4s/9VeICedcWmH1T8EFti0vhsvEKnaNVNOCr1YFuI6DjTyAlDT7JgfVvBLeNi4LgtxT4XlZwtN/4bDi9C1StlexHIaVQ0EviuMN/7oRFk6D77dDr/sQfdbvDgvEwd1RsKOcxaNg+sfuoCA4W8FVOKLl//m7IW7wv+PMWQWSPgv84oaCXxHCHF+6GhY8EH7pe8OuyHVrRUM6RyVsCr/w+/oA/mPxvgxeLkoL/lDb7gr9RZwV/OaCgl6PnDi/9T3Ck3fnmYEqDYxG6u7bArGHB9AktLoPLxytUDiZvCbw6Oph64mgC/mAU/OWagl6OjnswjPLm2OCo+sI/Htsj62h031DOSacFQzmntjt2+y/vigd811ug00+gyollu9/8b4sN9SxW8CeRgl6OnDvM/w28Pgayb4SL/zd5wycbFgZTLOz8MriyttNPKvZQTrIC/mAOGfxtiwV/zeTUGGIKejlyr4wOxnvbD4RLxkJKkr+rRkM5wZQRr/wheA6q1YauI5Ib8AcTT/DvPdq31OAsoJS9P4v+Hvt5QJ8ifS2lYr/oo6CXI/Xqn+C/v4G218NlDyc/5PeKRmHBwzD3gYo1lHNAwN8SDKWVt4A/mD27Dgz+aH7itm9FXyT2vhAUe+Gw4i8kseXKJwST7jXqDOmdjst3HAp6OXyvPwjzHoCsq+GKvwR/EOXNhoXBlMg7NwUfDnccEs6juuM94A9mzy74YmVw4Va0IJhOI1oAHvtZeF/R5dh9pfXxaLF1ii8X67NrM3z+brBdS4EGrYIrhk/vEvw88XvJfrZKpaCXw/PmQ/DyvdD6SvjBX8tnyO+1awvM+mkwTp15efDOIyxDOWEN+PLqu2+Cdxkb3oYNC4Lf83cFbbUbB4G/91avabk7qFDQS/wW/CWYK77lD6DfZEhNS3ZFpdtvKKdRbCinbbKrOnKfLIVX/xBcHVytNnQZHgT8cTiccFyL5MPny4Pg//it4Oeur4K26nVjod85+HlKG0itlNRyFfQSn4WT4D+/DD7k/NGUpP/HPWwb3o6dlXOcDuUo4Ms3d9j8IWx4a99R/5Z1QVtatdgYfyz8T+t0zN95KeildIsnw//9HM68JDgiPt5Cfq+dm4OhnLUvB+9KLn2o/AelAv74teOLIPA3vB28AHy+Ihj7txT4Xusiwz2dy3ycX0Evh7bkMfj3rdDsQvjxE5BWOdkVHZ1oFN4aB/N+HQzl/Pjx4K11eVM04KuetG8MXgF//Ppux75x/o/fCq4kLvg2aKudsf8HvHW/n9B3nAp6ObilT8Kc4dC0N1z1d0irkuyKEufjBcFQzq6voO/vgwu+ysNQzqfLgg9Z3/9PLOCHQ6efKuDDKJIPny3ff7hn1+agrXq9fWP8jboEU0UfxTvpihP0j/YJLsg46EUVxc+rTQvODd9vOXbxxWH3KbLt1MrB27aapyb+yUmk3KeCKYHPOBeufgoqVU12RYm3czPMGgpr50K95lCpWnLrieTDlysV8BWVO3z1wf7DPV+vD9oqVYfTzoLrZx7RNSuHCvrj4JSKw3BC/WCq1aLnyebvOcT5tsXPxy1+vu1RXsxR54x9l3037g41T0nM40yE5c8EIZ9xTvBl3WEMeYAadeHaZ2HhRFj3SrKrCbTqpyGaisoM6jcLbh0GBvdt/ww2vh0E/7dfl8mFieE6oi8L0WgpLwZ7L/Ao0mfPzn2z/H38Fny3LdjW3uDPOAdO75a84H/3n/DPIUEN1z4Dlasnpw4RSZiKM3RTHkUjwSfxey/7Lhr8db9f5MscjlHwr5wdjFufdhZcPwMq1yj7fYpImVPQlyfRSHARxn7Bvz1oKxr8jbsn/nSs1f+GZwdBw+wg5HWFpUhoHHXQm1lfYByQCkx299HF2msBfwcaEYz7j3H3qbG2k4DJQCvAgRvcfcGh9hfqoC/ukMHfNBb83Y8++Nc8D8/0Dyb/un6mxodFQuaogt7MUoH3gQuAPGAxcI27ryrS5x6glrvfZWb1gfeA77n7HjN7HHjd3SebWWWgurtvPdQ+K1TQF1cWwf/+izD9uuBMoAGzwzMXjIgUOtqzbjoBa919XWxj04HLgVVF+jhwopkZcAKwBSgws5rAOcAgAHffA+w5wsdRMaSkBkfdp7YLLqCJFOwf/CtmwJKpQd+9wZ9xNpzeHU5scOD21s6Fp6+HBi2h/yyFvEgFFE/QNwQ2FlnOA84q1mc8MAf4FDgRuMrdo2bWBNgETDWzNsAS4FZ333nUlVcUqWnQsH1w6zbi0MFfr9m+o/3Tu8OXq+Cpa6F+8yDkq52U1IciIskRT9CXdClh8fGePkAucB5wBvCymb0e23574BZ3X2hm44CRwL0H7MRsKDAUoFGjRvHWX/EcNPhfD4J/+bOQMyXoaylwciYMmAPV6yS3bhFJmniCPg84rchyOsGRe1GDgdEeDPivNbOPgDOBDUCeuy+M9ZtBEPQHcPdJwCQIxujjfgQV3X7Bf2ss+N8JQn/H53D2zxXyIhVcPEG/GGhqZhnAJ8DVwLXF+mwAegGvm1kDoDmwzt2/MrONZtbc3d+L9VmFlJ3UNGjYIbiJiBBH0Lt7gZkNB14kOL1yiruvNLNhsfaJwK+Bx8xsBcFQz13uHpuhn1uAabEzbtYRHP2LiMgxogumRERC4FCnVyZ+9hwRESlXFPQiIiGnoBcRCTkFvYhIyCnoRURCTkEvIhJy5fL0SjPbBHyc7DqOUj3gq1J7VQx6Lvan52N/ej72OZrn4nR3r19SQ7kM+jAws5yDndNa0ei52J+ej/3p+dinrJ4LDd2IiIScgl5EJOQU9GVnUrILKEf0XOxPz8f+9HzsUybPhcboRURCTkf0IiIhp6AXEQk5BX0CmdlpZvZfM1ttZivN7NZk15RsZpZqZsvM7Llk15JsZnaSmc0wszWx/yNdkl1TMpnZ7bG/k3fN7Ckzq5rsmo4lM5tiZl+a2btF7qtjZi+b2Qexn7UTsS8FfWIVAD939xZAZ+BnZpaZ5JqS7VZgdbKLKCfGAS+4+5lAGyrw82JmDYERQLa7tyL4UqOrk1vVMfcY0LfYfSOBee7eFJjHQb569XAp6BPI3T9z96Wx33cQ/CE3TG5VyWNm6cDFwORk15JsZlYTOAd4FMDd97j71qQWlXxpQDUzSwOqc+B3UYeau78GbCl29+XA47HfHweuSMS+FPRlxMwaA+2AhaV0DbOxwJ1ANMl1lAdNgE3A1NhQ1mQzq5HsopLF3T8BxhB83/RnwDZ3fym5VZULDdz9MwgOHIGTE7FRBX0ZMLMTgH8Ct7n79mTXkwxmdgnwpbsvSXYt5UQa0B54xN3bATtJ0Nvy41Fs7PlyIAM4FahhZtcnt6rwUtAnmJlVIgj5ae4+M9n1JFE34DIzWw9MB84zs78nt6SkygPy3H3vO7wZBMFfUZ0PfOTum9w9H5gJdE1yTeXBF2Z2CkDs55eJ2KiCPoHMzAjGYFe7+4PJrieZ3P1ud09398YEH7LNd/cKe8Tm7p8DG82seeyuXsCqJJaUbBuAzmZWPfZ304sK/OF0EXOAgbHfBwL/SsRG0xKxESnUDegPrDCz3Nh997j788krScqRW4BpZlYZWAcMTnI9SePuC81sBrCU4Gy1ZVSwqRDM7CmgJ1DPzPKA+4HRwDNmdiPBi+GVCdmXpkAQEQk3Dd2IiIScgl5EJOQU9CIiIaegFxEJOQW9iEjIKehFREJOQS8iEnL/H63OigiHttVVAAAAAElFTkSuQmCC\n",
      "text/plain": [
       "<Figure size 432x288 with 1 Axes>"
      ]
     },
     "metadata": {
      "needs_background": "light"
     },
     "output_type": "display_data"
    }
   ],
   "source": [
    "rfc_l = []\n",
    "clf_l = []\n",
    "\n",
    "for i in range(10):\n",
    "    rfc = RandomForestClassifier(n_estimators=25)\n",
    "    rfc_s = cross_val_score(rfc,X,y,cv=10).mean()\n",
    "    rfc_l.append(rfc_s)\n",
    "    \n",
    "    clf = DecisionTreeClassifier()\n",
    "    clf_s = cross_val_score(clf,X,y,cv=10).mean()\n",
    "    clf_l.append(clf_s)\n",
    "    \n",
    "plt.plot(range(1,11),rfc_l,label='Random Forest')\n",
    "plt.plot(range(1,11),clf_l,label='Decision Tree')\n",
    "plt.legend()\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e066c716",
   "metadata": {},
   "source": [
    "# 随机森林算法实战 —— 泰坦尼克号生存预测"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5692abaf",
   "metadata": {},
   "source": [
    "**输入参数**\n",
    "\n",
    "pclass:乘客的仓位（1-一等舱；2-二等舱；3-三等舱）\n",
    "\n",
    "name：乘客姓名\n",
    "\n",
    "sex：性别\n",
    "\n",
    "age：年龄\n",
    "\n",
    "sibsp：兄弟姐妹、伴侣人数\n",
    "\n",
    "parch：父母人数\n",
    "\n",
    "ticket：票号\n",
    "\n",
    "fare：产票价格\n",
    "\n",
    "cabin：船舱号\n",
    "\n",
    "embarked：上船地点\n",
    "\n",
    "**输出参数**\n",
    "\n",
    "survivied：是否生还"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "d9deba71",
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "import numpy as np#useless\n",
    "from sklearn.feature_extraction import DictVectorizer#useless\n",
    "from sklearn.model_selection import train_test_split\n",
    "from sklearn.metrics import make_scorer,accuracy_score#useless\n",
    "from sklearn import preprocessing\n",
    "\n",
    "import warnings\n",
    "warnings.filterwarnings('ignore')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "9cc67837",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>survived</th>\n",
       "      <th>name</th>\n",
       "      <th>pclass</th>\n",
       "      <th>sex</th>\n",
       "      <th>age</th>\n",
       "      <th>sibsp</th>\n",
       "      <th>parch</th>\n",
       "      <th>fare</th>\n",
       "      <th>embarked</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1</td>\n",
       "      <td>Allen, Miss. Elisabeth Walton</td>\n",
       "      <td>1</td>\n",
       "      <td>female</td>\n",
       "      <td>29.0000</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>211.3375</td>\n",
       "      <td>S</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>1</td>\n",
       "      <td>Allison, Master. Hudson Trevor</td>\n",
       "      <td>1</td>\n",
       "      <td>male</td>\n",
       "      <td>0.9167</td>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "      <td>151.5500</td>\n",
       "      <td>S</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>0</td>\n",
       "      <td>Allison, Miss. Helen Loraine</td>\n",
       "      <td>1</td>\n",
       "      <td>female</td>\n",
       "      <td>2.0000</td>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "      <td>151.5500</td>\n",
       "      <td>S</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>0</td>\n",
       "      <td>Allison, Mr. Hudson Joshua Creighton</td>\n",
       "      <td>1</td>\n",
       "      <td>male</td>\n",
       "      <td>30.0000</td>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "      <td>151.5500</td>\n",
       "      <td>S</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>0</td>\n",
       "      <td>Allison, Mrs. Hudson J C (Bessie Waldo Daniels)</td>\n",
       "      <td>1</td>\n",
       "      <td>female</td>\n",
       "      <td>25.0000</td>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "      <td>151.5500</td>\n",
       "      <td>S</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   survived                                             name  pclass     sex  \\\n",
       "0         1                    Allen, Miss. Elisabeth Walton       1  female   \n",
       "1         1                   Allison, Master. Hudson Trevor       1    male   \n",
       "2         0                     Allison, Miss. Helen Loraine       1  female   \n",
       "3         0             Allison, Mr. Hudson Joshua Creighton       1    male   \n",
       "4         0  Allison, Mrs. Hudson J C (Bessie Waldo Daniels)       1  female   \n",
       "\n",
       "       age  sibsp  parch      fare embarked  \n",
       "0  29.0000      0      0  211.3375        S  \n",
       "1   0.9167      1      2  151.5500        S  \n",
       "2   2.0000      1      2  151.5500        S  \n",
       "3  30.0000      1      2  151.5500        S  \n",
       "4  25.0000      1      2  151.5500        S  "
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "data_file_path = 'data/titanic3.xls'\n",
    "df_data = pd.read_excel(data_file_path)\n",
    "selected_cols = ['survived','name','pclass','sex','age','sibsp','parch','fare','embarked']\n",
    "selected_df_data = df_data[selected_cols]\n",
    "selected_df_data['embarked'].count()\n",
    "selected_df_data.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "id": "bcc99289",
   "metadata": {},
   "outputs": [],
   "source": [
    "# 数据处理\n",
    "def prepare_data(df_data):\n",
    "    df = df_data.drop(['name'],axis=1)  # 姓名训练时不需要，去掉\n",
    "    age_mean = df['age'].mean()\n",
    "    df['age'] = df['age'].fillna(age_mean) #缺失的年龄以平均值填充\n",
    "    fare_mean = df['fare'].mean()\n",
    "    df['fare'] = df['fare'].fillna(fare_mean) # 缺失的票价以平均值填充\n",
    "    \n",
    "    df['sex'] = df['sex'].map({'female':0,'male':1}).astype(int) # map映射性别为数字\n",
    "    \n",
    "    df['embarked'] = df['embarked'].fillna('S') #缺失值用最多的值取代\n",
    "    df['embarked'] = df['embarked'].map({'C':0,'Q':1,'S':2}).astype(int)\n",
    "    \n",
    "    narray_data = df.values\n",
    "    features = narray_data[:,1:] # 没有生存情况\n",
    "    label= narray_data[:,0] # 目标类-生存情况\n",
    "    \n",
    "    minmax_scale = preprocessing.MinMaxScaler(feature_range=(0,1))\n",
    "    norm_features = minmax_scale.fit_transform(features) # 归一化\n",
    "    \n",
    "    return norm_features,label # 返回处理后的数据和标签类"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "id": "12b399bb",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "914"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "selected_df_data.groupby(['embarked']).count()\n",
    "selected_df_data['embarked'].value_counts()[0]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "id": "deb68776",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>survived</th>\n",
       "      <th>name</th>\n",
       "      <th>pclass</th>\n",
       "      <th>sex</th>\n",
       "      <th>age</th>\n",
       "      <th>sibsp</th>\n",
       "      <th>parch</th>\n",
       "      <th>fare</th>\n",
       "      <th>embarked</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1</td>\n",
       "      <td>Allen, Miss. Elisabeth Walton</td>\n",
       "      <td>1</td>\n",
       "      <td>female</td>\n",
       "      <td>29.0000</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>211.3375</td>\n",
       "      <td>S</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>1</td>\n",
       "      <td>Allison, Master. Hudson Trevor</td>\n",
       "      <td>1</td>\n",
       "      <td>male</td>\n",
       "      <td>0.9167</td>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "      <td>151.5500</td>\n",
       "      <td>S</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>0</td>\n",
       "      <td>Allison, Miss. Helen Loraine</td>\n",
       "      <td>1</td>\n",
       "      <td>female</td>\n",
       "      <td>2.0000</td>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "      <td>151.5500</td>\n",
       "      <td>S</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>0</td>\n",
       "      <td>Allison, Mr. Hudson Joshua Creighton</td>\n",
       "      <td>1</td>\n",
       "      <td>male</td>\n",
       "      <td>30.0000</td>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "      <td>151.5500</td>\n",
       "      <td>S</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>0</td>\n",
       "      <td>Allison, Mrs. Hudson J C (Bessie Waldo Daniels)</td>\n",
       "      <td>1</td>\n",
       "      <td>female</td>\n",
       "      <td>25.0000</td>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "      <td>151.5500</td>\n",
       "      <td>S</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   survived                                             name  pclass     sex  \\\n",
       "0         1                    Allen, Miss. Elisabeth Walton       1  female   \n",
       "1         1                   Allison, Master. Hudson Trevor       1    male   \n",
       "2         0                     Allison, Miss. Helen Loraine       1  female   \n",
       "3         0             Allison, Mr. Hudson Joshua Creighton       1    male   \n",
       "4         0  Allison, Mrs. Hudson J C (Bessie Waldo Daniels)       1  female   \n",
       "\n",
       "       age  sibsp  parch      fare embarked  \n",
       "0  29.0000      0      0  211.3375        S  \n",
       "1   0.9167      1      2  151.5500        S  \n",
       "2   2.0000      1      2  151.5500        S  \n",
       "3  30.0000      1      2  151.5500        S  \n",
       "4  25.0000      1      2  151.5500        S  "
      ]
     },
     "execution_count": 19,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "selected_df_data.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "id": "69fb436f",
   "metadata": {},
   "outputs": [],
   "source": [
    "shuffled_df_data = selected_df_data.sample(frac=1) # 打乱顺序\n",
    "x_data,y_data = prepare_data(shuffled_df_data)\n",
    "\n",
    "train_size = int(len(x_data)*0.8)\n",
    "x_train = x_data[:train_size]\n",
    "y_train = y_data[:train_size]\n",
    "x_test = x_data[train_size:]\n",
    "y_test = y_data[train_size:]\n",
    "\n",
    "# x_trian, x_test, y_train, y_test = train_test_split(x_data,y_data,test_size=0.2)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "id": "77ba5778",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([0., 0., 1., ..., 0., 0., 1.])"
      ]
     },
     "execution_count": 30,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "y_data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "id": "4853ae34",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([[1.        , 1.        , 0.27348613, ..., 0.        , 0.01756683,\n",
       "        1.        ],\n",
       "       [0.5       , 1.        , 0.42379934, ..., 0.        , 0.02537431,\n",
       "        1.        ],\n",
       "       [1.        , 0.        , 0.27348613, ..., 0.        , 0.01517579,\n",
       "        1.        ],\n",
       "       ...,\n",
       "       [1.        , 1.        , 0.24843392, ..., 0.        , 0.01546857,\n",
       "        1.        ],\n",
       "       [1.        , 0.        , 0.37220602, ..., 0.        , 0.01512699,\n",
       "        0.5       ],\n",
       "       [0.        , 0.        , 0.33611663, ..., 0.11111111, 0.48312843,\n",
       "        0.        ]])"
      ]
     },
     "execution_count": 31,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "x_data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "id": "496692b8",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<style>#sk-container-id-1 {color: black;background-color: white;}#sk-container-id-1 pre{padding: 0;}#sk-container-id-1 div.sk-toggleable {background-color: white;}#sk-container-id-1 label.sk-toggleable__label {cursor: pointer;display: block;width: 100%;margin-bottom: 0;padding: 0.3em;box-sizing: border-box;text-align: center;}#sk-container-id-1 label.sk-toggleable__label-arrow:before {content: \"▸\";float: left;margin-right: 0.25em;color: #696969;}#sk-container-id-1 label.sk-toggleable__label-arrow:hover:before {color: black;}#sk-container-id-1 div.sk-estimator:hover label.sk-toggleable__label-arrow:before {color: black;}#sk-container-id-1 div.sk-toggleable__content {max-height: 0;max-width: 0;overflow: hidden;text-align: left;background-color: #f0f8ff;}#sk-container-id-1 div.sk-toggleable__content pre {margin: 0.2em;color: black;border-radius: 0.25em;background-color: #f0f8ff;}#sk-container-id-1 input.sk-toggleable__control:checked~div.sk-toggleable__content {max-height: 200px;max-width: 100%;overflow: auto;}#sk-container-id-1 input.sk-toggleable__control:checked~label.sk-toggleable__label-arrow:before {content: \"▾\";}#sk-container-id-1 div.sk-estimator input.sk-toggleable__control:checked~label.sk-toggleable__label {background-color: #d4ebff;}#sk-container-id-1 div.sk-label input.sk-toggleable__control:checked~label.sk-toggleable__label {background-color: #d4ebff;}#sk-container-id-1 input.sk-hidden--visually {border: 0;clip: rect(1px 1px 1px 1px);clip: rect(1px, 1px, 1px, 1px);height: 1px;margin: -1px;overflow: hidden;padding: 0;position: absolute;width: 1px;}#sk-container-id-1 div.sk-estimator {font-family: monospace;background-color: #f0f8ff;border: 1px dotted black;border-radius: 0.25em;box-sizing: border-box;margin-bottom: 0.5em;}#sk-container-id-1 div.sk-estimator:hover {background-color: #d4ebff;}#sk-container-id-1 div.sk-parallel-item::after {content: \"\";width: 100%;border-bottom: 1px solid gray;flex-grow: 1;}#sk-container-id-1 div.sk-label:hover label.sk-toggleable__label {background-color: #d4ebff;}#sk-container-id-1 div.sk-serial::before {content: \"\";position: absolute;border-left: 1px solid gray;box-sizing: border-box;top: 0;bottom: 0;left: 50%;z-index: 0;}#sk-container-id-1 div.sk-serial {display: flex;flex-direction: column;align-items: center;background-color: white;padding-right: 0.2em;padding-left: 0.2em;position: relative;}#sk-container-id-1 div.sk-item {position: relative;z-index: 1;}#sk-container-id-1 div.sk-parallel {display: flex;align-items: stretch;justify-content: center;background-color: white;position: relative;}#sk-container-id-1 div.sk-item::before, #sk-container-id-1 div.sk-parallel-item::before {content: \"\";position: absolute;border-left: 1px solid gray;box-sizing: border-box;top: 0;bottom: 0;left: 50%;z-index: -1;}#sk-container-id-1 div.sk-parallel-item {display: flex;flex-direction: column;z-index: 1;position: relative;background-color: white;}#sk-container-id-1 div.sk-parallel-item:first-child::after {align-self: flex-end;width: 50%;}#sk-container-id-1 div.sk-parallel-item:last-child::after {align-self: flex-start;width: 50%;}#sk-container-id-1 div.sk-parallel-item:only-child::after {width: 0;}#sk-container-id-1 div.sk-dashed-wrapped {border: 1px dashed gray;margin: 0 0.4em 0.5em 0.4em;box-sizing: border-box;padding-bottom: 0.4em;background-color: white;}#sk-container-id-1 div.sk-label label {font-family: monospace;font-weight: bold;display: inline-block;line-height: 1.2em;}#sk-container-id-1 div.sk-label-container {text-align: center;}#sk-container-id-1 div.sk-container {/* jupyter's `normalize.less` sets `[hidden] { display: none; }` but bootstrap.min.css set `[hidden] { display: none !important; }` so we also need the `!important` here to be able to override the default hidden behavior on the sphinx rendered scikit-learn.org. See: https://github.com/scikit-learn/scikit-learn/issues/21755 */display: inline-block !important;position: relative;}#sk-container-id-1 div.sk-text-repr-fallback {display: none;}</style><div id=\"sk-container-id-1\" class=\"sk-top-container\"><div class=\"sk-text-repr-fallback\"><pre>RandomForestClassifier(criterion=&#x27;entropy&#x27;, max_depth=10, n_estimators=200)</pre><b>In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. <br />On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.</b></div><div class=\"sk-container\" hidden><div class=\"sk-item\"><div class=\"sk-estimator sk-toggleable\"><input class=\"sk-toggleable__control sk-hidden--visually\" id=\"sk-estimator-id-1\" type=\"checkbox\" checked><label for=\"sk-estimator-id-1\" class=\"sk-toggleable__label sk-toggleable__label-arrow\">RandomForestClassifier</label><div class=\"sk-toggleable__content\"><pre>RandomForestClassifier(criterion=&#x27;entropy&#x27;, max_depth=10, n_estimators=200)</pre></div></div></div></div></div>"
      ],
      "text/plain": [
       "RandomForestClassifier(criterion='entropy', max_depth=10, n_estimators=200)"
      ]
     },
     "execution_count": 26,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "srfc = RandomForestClassifier(n_estimators=200,criterion='entropy',max_depth=10)\n",
    "srfc.fit(x_train,y_train)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "id": "3360ac7f",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "分类得分：0.7748091603053435\n"
     ]
    }
   ],
   "source": [
    "print('分类得分：{}'.format(srfc.score(x_test,y_test)))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "bf502463",
   "metadata": {},
   "source": [
    "### classification_report()函数 输出模型评估报告\n",
    "\n",
    "|参数|\t描述|\n",
    "|:--|:--|\n",
    "|y_true\t|真实值 ，一维数组形式（也可以是列表元组之类的）|\n",
    "|y_pred\t|预测值，一维数组形式（也可以是列表元组之类的）|\n",
    "|labels\t|标签索引列表，可选参数，数组形式|\n",
    "|target_names\t|与标签匹配的名称，可选参数，数组形式|\n",
    "|sample_weight\t|样本权重，数组形式|\n",
    "|digits\t|格式化输出浮点值的位数。默认为2。当“output_dict”为“True”时，这将被忽略，并且返回的值不会四舍五入。|\n",
    "|output_dict|\t是否输出字典。默认为False，如果为True则输出结果形式为字典。|\n",
    "|zero_division|\t设置存在零除法时返回的值。默认为warn。如果设置为“warn”，这相当于0，但也会引发警告。|"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "39112519",
   "metadata": {},
   "source": [
    "* 由图可见，precisoin即准确率，也称查准率。\n",
    "* recall是召回率 ，也称查全率，\n",
    "* f1-score简称F1\n",
    "\n",
    "***\n",
    "<br>\n",
    "\n",
    "**accruracy 整体的准确率 即正确预测样本量与总样本量的比值。（不是针对某个标签的预测的正确率）**\n",
    "\n",
    "* macro avg 即宏均值，可理解为普通的平均值。\n",
    "* macro-P 宏查准率\n",
    "* macro-R 宏查全率\n",
    "* macro-F1 宏F1\n",
    "\n",
    "***\n",
    "<br>\n",
    "\n",
    "* 对应的概念还有 微均值 micro avg\n",
    "* 以 micro-P 为例，不是直接对各个准确率求平均，而是求其构成元素TP、FP、TN、FN的平均值\n",
    "* micro-R 同理，即所有类别中预测正确的量占该标签实际数量的比例。"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f538e371",
   "metadata": {},
   "source": [
    "https://blog.csdn.net/hfutdog/article/details/88085878"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "id": "ba93bea8",
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.metrics import classification_report"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "id": "fbe72a2b",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "              precision    recall  f1-score   support\n",
      "\n",
      "         0.0       0.79      0.88      0.83       168\n",
      "         1.0       0.73      0.59      0.65        94\n",
      "\n",
      "    accuracy                           0.77       262\n",
      "   macro avg       0.76      0.73      0.74       262\n",
      "weighted avg       0.77      0.77      0.77       262\n",
      "\n"
     ]
    }
   ],
   "source": [
    "y_pre = srfc.predict(x_test)\n",
    "print(classification_report(y_test,y_pre))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e7bdce41",
   "metadata": {},
   "source": [
    "|返回值|解释|\n",
    "|:--|:--|\n",
    "|support：|当前行的类别在测试数据中的样本总量，如上表就是，在class 0 类别在测试集中总数量为1；|\n",
    "|precision：|精度=正确预测的个数(TP)/被预测正确的个数(TP+FP)；人话也就是模型预测的结果中有多少是预测正确的|\n",
    "|recall:|召回率=正确预测的个数(TP)/预测个数(TP+FN)；人话也就是某个类别测试集中的总量，有多少样本预测正确了；|\n",
    "|f1-score:|F1 = 2* 精度* 召回率/(精度+召回率)|\n",
    "|micro avg：|计算所有数据下的指标值，假设全部数据 5 个样本中有 3 个预测正确，所以 micro avg 为 3/5=0.6|\n",
    "|macro avg：|每个类别评估指标未加权的平均值，比如准确率的 macro avg，(0.50+0.00+1.00)/3=0.5|\n",
    "|weighted avg：|加权平均，就是测试集中样本量大的，我认为它更重要，给他设置的权重大点；比如第一个值的计算方法，(0.50*1 + 0.0*1 + 1.0*3)/5 = 0.70|"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 33,
   "id": "d03ae61b",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Jack和Rose，\n",
      "生存： 0.0\n",
      "生存： 1.0\n"
     ]
    }
   ],
   "source": [
    "# 预测\n",
    "Jack_infor=[0,'Jack',3,'male',23,1,0,5.000,'S']\n",
    "Rose_infor=[1,'Rose',1,'female',20,1,0,100.000,'S']\n",
    "new_passenger_pd = pd.DataFrame([Jack_infor,Rose_infor],columns=selected_cols)\n",
    "all_passenger_pd = selected_df_data._append(new_passenger_pd) # 与旧的合成\n",
    "\n",
    "x,y = prepare_data(all_passenger_pd)\n",
    "y_pre = srfc.predict(x[-2:,:])\n",
    "\n",
    "print('Jack和Rose，')\n",
    "for i in range(len(y_pre)):\n",
    "    print('生存：',y_pre[i])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "6f8754dc",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "f6c2f2aa",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "ec87916d",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "05e66673",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "f69f1c7c",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d5a8c205",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "49e4aae8",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "fa5f3325",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "0b6ea95c",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "07286ce8",
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.7"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
